Language Determination: Natural Language Processing from Scanned Document Images
نویسندگان
چکیده
Many documents are available to a computer only as images from paper. However, most natural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for converting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap and robust, is sufficient for many NLP tasks. In this paper, we show that the representation is sufficient for determining which of 23 languages the document is written in, using only a small number of features, with greater than 90% accuracy overall. 1 I n t r o d u c t i o n Computational linguists work with texts. Computational lin!mistic applications range from natural language understanding to information retrieval to machine translation. Such systems usually assume the language of the text that is being processed. However, as corpora become larger and more diverse this assumption becomes less warranted. Attention is now turning to the issue of determining the language or languages of a text before further processing is done. Several sources of information for language determination have been tried: short words (Kulikowski 1991, Ingle 1976); n-grams of words (Batchelder 1992); n-grams of characters (Cavner & Trenkle 1994); diacritics and special characters (Beesley 1988, Newman 1987); syllable characteristics (Mustonen 1965); morphology and syntax (Ziegler 1991). F~ch of these approaches is prGmising although none is completely accurate. More fundamentally, many rely on relatively large amounts of text data and all rely on data in the form of character codes (e.g., ASCID. In today's world of text-based information, however, not all sources of text will be character coded. Many documents such as incoming faxes, patent applications, and office memos are only accessible on paper. Processes such as Optical Character Recognition (OCR) have been developed for mapping paper documents into character-coded text. However, for applications like OCR, it is desirable to know the language a document is in before trying to decode its characters. There appears to be a fundamental Catch-22: natural language processing systems want to be able to work automatically with arbitrary documents, many of which may be available only on paper, and in the process, they minimally need to know which language or languages are present. The algorithms cited above can determine a document's language, but they require a character-coded representation of the text. OCR can produce such a representation, but OCR does not work well unless the language(s) of the document are known. So how can the language of a paper document be determined? We have developed a method which reliably determines the language or lan£xlages of a document image. In this paper, we discuss Roman-alphabet languages such as English, Polish, and Swahili; see Spitz (1994) for a discussion of the determination of Asian-script languages. Our method finesses the problems inherent in mapping from an image to a character-coded representation: we map instead from the image to a shape-based representation. The basal representation is the character shape code of which there are a small number. These shape codes are aggregated into word shape tokens which are delimited by white space. From examining these word shape tokens we can determine the language of the document. An example of the transformation from character codes to character shape codes is shown in figurel . Character codes Confidence in the international monetary system was shaky enough before last week's action. Character shape codes AxxAAxxxx ix AAx i x A x x x x A i x x x A xxxxAxxg xgxAxx xxx xAxAg xxxxgA AxAxxx AxxA xxxA'x xxAixx . Figure 1: Character code representation and character shape code representation. The shape-based representation of a document is proving to be a remarkably rich source of information. While our initial goal has been to use it for language identification, in support of downstream OCR pro-
منابع مشابه
روش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملPlagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting
With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...
متن کاملWhat do Journalists do with Documents? Field Notes for Natural Language Processing Researchers
Natural language processing and visualization systems have been proposed to help journalists analyze large sets of documents, but very little has been said on what journalists do with documents in practice. We review a collection of 15 stories completed with the Overview document mining platform, characterizing the source material and reporting tasks. The median document set contained 4,000 doc...
متن کاملA methodology for document processing: separating text from images
This paper presents a methodology for document processing, by separating text paragraphs from images. The methodology is based on the recognition of text characters and words for the efficient separation text paragraphs from images by keeping their relationships for a possible reconstruction of the original page. The text separation and extraction is based on a hierarchical framing process. The...
متن کاملInformation Processing from Document Images
Analysis of document images for information extraction has become very prominent in recent past. Wide variety of information, which has been conventionally stored on paper is now being converted into electronic form for better storage and intelligent processing. This needs processing of documents using image analysis algorithms. Document image analysis differs from the conventional image proces...
متن کامل